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Abstract 

We present a general method to detect and extract from a finite 
time sample statistically meaningful correlations between input and 
output variables of large dimensionality. Our central result is derived 
from the theory of free random matrices, and gives an explicit expres- 
sion for the interval where singular values are expected in the absence 
of any true correlations between the variables under study. Our re- 
sult can be seen as the natural generalization of the Marcenko-Pastur 
distribution for the case of rectangular correlation matrices. We il- 
lustrate the interest of our method on a set of macroeconomic time 
series. 

1 Introduction 

Finding correlations between observables is at the heart of scientific method- 
ology. Once correlations between "causes" and "effects" are empirically es- 
tablished, one can start devising theoretical models to understand the mech- 
anisms underlying such correlations, and use these models for prediction 
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purposes. In many cases, the number of possible causes and of resulting ef- 
fects are both large. For example, one can list an a priori large number of 
environmental factors possibly favoring the appearance of several symptoms 
or diseases, or of social/educational factors determining choices and tastes 
on different topics. A vivid example is provided by Amazon.com, where taste 
correlations between a huge number of different products (books, CDs, etc.) 
are sought for. In the context of gene expression networks, the number of 
input and output chemicals and proteins, described by their concentration, is 
very large. In an industrial setting, one can monitor a large number of char- 
acteristics of a device (engine, hardware, etc.) during the production phase 
and correlate these with the performances of the final product. In economics 
and finance, one aims at understanding the relation between a large number 
of possibly relevant factors, such as interest and exchange rates, industrial 
production, confidence index, etc. on, say, the evolution of inflation in dif- 
ferent sectors of activity [I], or on the price of different stocks. Nowadays, 
the number of macroeconomic time series available to economists is huge (see 
below). This has lead Granger f| and others to suggest that "large models" 
should be at the forefront of the econometrics agenda. The theoretical study 
of high dimensional factor models is indeed actively pursued jSl IH 13 El El C] , 
in particular in relation with monetary policy E] • 

In the absence of information on the phenomenon under study, a brute 
force strategy would consist in listing a large number of possible explanatory 
variables and a large number of output variables, and systematically look 
for correlations between pairs, in the hope of finding some significant signal. 
In an econometric context, this is the point of view advocated long ago by 
Sims [9J, who suggested to look at large Vector Autoregressive models, and 
let the system itself determine the number and the nature of the relevant 
variables. However, this procedure is rapidly affected by the "dimensionality 
curse" , also called the problem of sunspot variables in the economics litera- 
ture JU]- Since the number of observations is always limited, it can happen 
that two totally unrelated phenomenon (such as, for example, stock prices 
and sunspots) appear to be correlated over a certain time interval T. More 
precisely, the correlation coefficient p, which would (presumably) be zero if 
very long time series could be studied, is in fact of the order of 1/y/T and 
can be accidentally large. When one tries to correlate systematically N in- 
put variables with M output variables, the number of pairs is NM. In the 
absence of any true correlation between these variables, the largest of these 
NM empirical correlation coefficients will be, for Gaussian variables, of order 
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Pmax ~ y2\n(NM)/T, which grows with NM. For example, p max ~ 0.25 
for N = M = 25 and T = 200. If the input and output variables are non 
Gaussian and have fat-tails, this number can be even larger: if two strongly 
fluctuating random variable accidentally take large values simultaneously, 
this will contribute a lot to the empirical correlation even though p should 
be zero for large T. 

In this paper we want to discuss how recent results in Random Matrix 
Theory ^TJ allow one to alleviate this dimensionality curse and give a 
precise procedure to extract significant correlations between N input vari- 
ables and M output variables, when the number of independent observations 
is T. The idea is to compare the singular value spectrum of the empirical 
rectangular MxN correlation matrix with a benchmark, obtained by assum- 
ing no correlation at all between the variables. For T — > oo at N, M fixed, 
all singular values should be zero, but this will not be true if T is finite. 
The singular value spectrum of this benchmark problem can in fact be com- 
puted exactly in the limit where N,M,T — ► oo, when the ratios m = M/T 
and n = N/T fixed. As usual with Random Matrix problems [TT) I12j. the 
singular value spectrum develops sharp edges in the asymptotic limit which 
are to a large extent independent of the distribution of the elements of the 
matrices. Any singular value observed to be significantly outside these edges 
can therefore be deemed to carry some relevant information. A similar so- 
lution has been known for a long time for standard correlation matrices, for 
example the correlations of the N input variables between themselves that 
define an N x N symmetric matrix. In this case, the benchmark is known as 
the Wishart ensemble, and the relevant eigenvalue spectrum is given by the 
Marcenko-Pastur distribution [T3J EH US] ■ Applications of this method to fi- 
nancial correlation matrices are relatively recent ^H] but very active fTJ EB] ■ 
Comparing the empirical eigenvalues to the correlation matrix to the theo- 
retical upper edge of the Marcenko-Pastur spectrum allows one to extract 
statistically significant factors [16J (although some may also be buried below 
the band edge, see [IZj). Similar ideas are starting to be discussed in the 
econometric community, in particular to deal with the problem of identifying 
the relevant factors in large dynamical factor models and using them 
for prediction purposes (see also jH] for a different point of view). Here, we 
extend the Marcenko-Pastur result to general rectangular, non-equal time 
correlation matrices. We will first present a precise formulation of our cen- 
tral result, which we will then illustrate using an economically relevant data 
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set, and finally discuss some possible extensions of our work. 

2 Mathematical formulation of the problem 

We will consider X input factors, denoted as X a , a — 1, N and M output 
factors Y a , a = 1, M. There is a total of T observations, where both X at 
and Y at , t = 1, T are observed. We assume that all X + M time series are 
standardized, i.e., both X's and F's have zero mean and variance unity. The 
X and the Y's may be completely different, or be the same set of observables 
but observed at different times, as for example N = M and Y at = X at+ i. 
From the set of X's and Y's one can form two correlations matrices, Cx and 
C Y) defined as: 

1 T 1 T 

(C X )ab = 7fY X atX b t (C Y ) a /3 = Tf, Y Y ottYpt- (1) 

1 t=l 1 t=l 

In general, the X's (and the F's) have no reason to be independent of each 
other, and the correlation matrices Cx and Cy will contain information on 
their correlations. As alluded to above, one can diagonalize both these ma- 
trices; provided T > N,M - which we will assume in the following - all 
eigenvalues will, in generic cases, be strictly positive. In certain cases, some 
eigenvalues will however lie close to zero, much below the lower edge of the 
Marcenko-Pastur interval, corresponding to redundant variables which may 
need to be taken care of (see below). Disregarding this problem for the mo- 
ment, we use the corresponding eigenvectors to define a set of uncorrelated, 
unit variance input variables X and output variables Y. For example, 

X a t = /=== V abX bt , (2) 
V 1 *a b 

where A a is the a th eigenvalue of Cx and V a b the components of the corre- 
sponding eigenvector. Now, by construction, C x = XX T and C Y = YY T are 
exactly identity matrices, of dimension, respectively, X and M. Using general 
property of diagonalisation, this means that the T x T matrices D x — X T X 
and D Y = Y T Y have exactly X (resp. M) eigenvalues equal to 1 and T — X 
(resp. T — M) equal to zero. 

Now, consider the M x X cross-correlation matrix G between the X's 
and the F's: 

{G) ab = YY at X u = YX T . (3) 
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The singular value decomposition (SVD) of this matrix answers the follow- 
ing question (20]: what is the (normalised) linear combination of X's on 
the one hand, and of Y*s on the other hand, that have the strongest mu- 
tual correlation? In other words, what is the best pair of predictor and 
predicted variables, given the data? The largest singular value s max and its 
corresponding left and right eigenvectors answer precisely this question: the 
eigenvectors tell us how to construct these optimal linear combinations, and 
the associated singular value gives us the strength of the cross-correlation. 
One can now restrict both the input and output spaces to the N — 1 and 
M — 1 dimensional sub-spaces orthogonal to the two eigenvectors, and repeat 
the operation. The list of singular values s a gives the prediction power, in 
decreasing order, of the corresponding linear combinations. 



3 Singular values from free random matrix 
theory 

How to get these singular values? If M < N, the trick is to consider the 
matrix M x M matrix GG T (or the N x N matrix G T G if M > N), which 
is symmetrical and has M positive eigenvalues, each of which being equal to 
the square of a singular value of G itself. The second observation is that the 
non-zero eigenvalues of GG T = YX T XY T are the same as those of the T x T 
matrix D = D-^D Y = X T XY T Y , obtained by swapping the position of Y 
from first to last. In the benchmark situation where the X's and the Y's are 
independent from each other, the two matrices Dj^ and D Y are mutually free 
[TT] and one can use results on the product of free matrices to obtain the 
eigenvalue density from that of the two individual matrices, which are known. 
The general recipe [TH ITTj is to construct first the so-called r]— transform of 
the eigenvalue density p(u) of a given T x T non negative matrix A, defined 
as: 

f 111 

VAil) = / dup(u)— = ^Tr— -. (4) 

J 1 + 7U 1 1 + 'j A 

From the functional inverse of T]a, one now defines the E-transform of A as: 

Mx) = ~—Vl 1 (l+x)- (5) 

x 

Endowed with these definitions, one of the fundamental theorems of Free 
Matrix Theory [IT] states that the S-transform of the product of two free 
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matrices A and B is equal to the product of the two E-transforms. [A similar, 
somewhat simpler, theorem exists for sums of free matrices, in terms of "R- 
transforms", see [H]]. Applying this theorem with A = and B = D Y , 
one finds: 



774(7) = 1-71+ 



?? 



n 



1 + 7' 

From this, one easily obtains: 



iV m 
-; ^( 7 ) = l_ m+ __ ; 



£ (6) 



Sk(x) = S A (a;)E jB (x) 



1 + x) 



(x + n)(x + m) 

Inverting back this relation allows one to derive the 77— transform of T> as: 
1 



(7) 



2(1 + 7) 



1 - (ji + z/)7 + ^(/i - i/) 2 7 2 - 2(/j + z/ + 2/ii/) 7 + 1 



(8) 

with fi = m — 1 and z/ = n — 1. The limit 7 — > 00 of this quantity gives 
the density of exactly zero eigenvalues, easily found to be equal to max(l — 
n, 1 — m), meaning, as expected, that the number of non zero eigenvalues of 
T> is min(iV, M). Depending on the value of n + m compared to unity, the 
pole at 7 = — 1 corresponding to eigenvalues exactly equal to one has a zero 
weight (for n + m < 1) or a non zero weight equal to n + m — 1. One can 
re-write the above result in terms of the more common Stieltjes transform of 
V, S(z) = 7)(—l/z)/z, which reads: 



S-d(z) 



2z(z 



z + (fi + u) + yj(n - Z/) 2 + 2(/i + V + 2ijlv)z + 2 2 



(9) 



The density of eigenvalues Pd(^) is then obtained from the standard re- 
lation [TT] : 



lim S 



1 m 1 

^ Tr ^ 

7Ti Z + IE — 1J 



[Sv{z + ie)), (10) 



which leads to the rather simple final expression, which is the central result 
of this paper, for the density of singular values s of the original correlation 
matrix G = YX T : 



p(s) = max(l— n, 1— m)5(s)+m&x(m+n— 1, 0)8(s— 1) 



Kv /( s 2_ 7 _ )(7+ _ s2) 



7CS(1 
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Figure 1: Continuous part of the theoretical random singular value spectrum p(s) 
for different values of n and m. Note that for n = m the spectrum extends down 
to s = 0, whereas for n + m — ► 1, the spectrum develops a (1 — s) -1 / 2 singularity, 
just before the appearance of a 5 peak at s = 1 of weight n + m — 1. 
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where 7± are the two positive roots of the quadratic expression under the 
square root in Eq. ® above, which read explicitely: 1 



7± = n + m — 2mn ± 2ymn(l — n)(l — m). (12) 

This is our main technical result, illustrated in Fig. 1. One can check that 
in the limit T — > oo at fixed N, M, all singular values collapse to zero, as 
they should since there is no true correlations between X and Y; the allowed 
band in the limit n, m — > becomes: 



s e 



Vn|, \/m + • (13) 



When n — > m, the allowed band becomes s e [0, 2^Jm(l — m)} (plus a 5 
function at s = 1 when n + m > 1), while when m = 1, the whole band 
collapses to a 5 function at s — y/1 — n. When n + m — > 1~, the inchoate 
£-peak at s = 1 is announced as a singularity of p(s) diverging as (1 — s) -1 ^ 2 . 
Finally, when m — > at fixed n, one finds that the whole band collapses 
again to a 5 function at s = y 7 ?!. This last result can be checked directly 
in the case one has one output variable (M = 1) that one tries to correlate 
optimally with a set of iV independent times series of length T. The result can 



easily be shown to be a correlation of JN/T. A plot of the SV density p(s) 
for values of m and n which will be used below is shown in Fig 2, together 
with a numerical determination of the SVD spectrum of two independent 
vector time series X and Y, after suitable diagonalisation of their empirical 
correlation matrices to construct their normalised counterparts, X and Y. 
The agreement with our theoretical prediction is excellent. 

Note that one could have considered a different benchmark ensemble, 
where the independent vector time series X and Y are not diagonalized and 
transformed into X and Y before SVD. The direct SVD spectrum in that 
case can also be computed as the S-convolution of two Marcenko-Pastur dis- 
tributions with parameters m and n, respectively (noted MP 2 in Fig. 2). 
The result, derived in the Appendix, is noticeably different from the above 
prediction (see Fig. 1). This alternative benchmark ensemble is however not 
well suited for our purpose, because it mixes up the possibly non trivial corre- 
lation structure of the input variables and of the output variables themselves 



1 One can check that 7+ < 1 for all values of n,m < 1. The upper bound is reached 
only when n + m = 1, in which case the upper edge of the singular value band touches 

8 = 1. 
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Figure 2: Random Singular Value spectrum p(s) for m = 35/265 and n = 
76/265. We show two possible theoretical calculations, corresponding either to 
bare random vectors X and Y, for which the singular value spectrum is related 
to the 'square' (in the free convolution sense) of the Marcenko-Pastur distribution 
MP 2 , or standardized vectors X and Y, obtained after diagonalizing the empirical 
correlation matrices of X and Y. We also show the results of a numerical simulation 
of the standardized case with T = 2650. 
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Figure 3: Histogram of the pair correlation coefficient p between X's and Y's, 
both at equal times and with one month lag. Note the 'island' of correlations 
around ~ 0.6 for one-month lagged correlations, which corresponds to correlations 
between oil prices and energy related CPI's one month later. We also show a 
Gaussian of variance 1/T, expected in the absence of any correlations. 

with the issue at stake here, namely the cross-correlations between input and 
output variables. 



4 Application: inflation vs. economic indica- 
tors 

We now turn to the analysis of a concrete example. We investigate how dif- 
ferent groups of US inflation indexes can be explained using combinations of 
indicators belonging to different economic sectors. As "outputs" Y, we use 34 
indicators of inflation, the monthly changes of the Composite Price Indexes 
(CPIs), concerning different sectors of activity including and excluding com- 
modities. These indexes were not selected very carefully and some are very 
redundant, since the point of our study is to show how the proposed method 
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is able to select itself the relevant variables. As explanatory variables, "in- 
puts" X, we use 76 different macroeconomic indicators from the following 
categories: industrial production, retail sales, new orders and inventory in- 
dexes of all available economic activity sectors, the most important consumer 
and producer confidence indexes, new payrolls and unemployment difference, 
interest rates (3 month, 2 and 10 years), G7 exchange rates against the Dol- 
lar and the WTI oil price itself. The total number of months in the period 
June 1983- July 2005 is 265. We want to see whether there is any signifi- 
cant correlation between changes of the CPIs and of the economic indexes, 
either simultaneous or one month ahead. We also investigated two-month 
lag correlations, for which we found very little signal. 

We first standardized the time series Y and X and form the rectangular 
correlation matrix between these two quantities, containing 34 x 76 numbers 
in the interval [—1,1]. The distribution of these pair correlations is shown in 
Fig. 3, both for equal time Y t X' t and for one-month lagged Y t Xj_ x correla- 
tions, and compared to a Gaussian distribution of variance T -1 . We see that 
the empirical distributions are significantly broader; in particular an 'island' 
of correlations around as 0.6 appears for the one-month lagged correlations. 
These correspond to correlations between oil prices and energy related CPIs 
one month later. The question is whether there are other predictable modes 
in the system, in particular, are the correlations in the left and right flanks 
of the central peak meaningful or not? This question is a priori non trivial 
because the kurtosis of some of the variables is quite high, which is expected 
to 'fatten' the distribution of p compared to the Gaussian. Within the period 
of about thirty years covered by our time series, three major rare events hap- 
pened: the Gulf War (1991-92), the Asian crisis (1998), and the Twin Towers 
Attack (2001). The kurtosis of the CPIs is the trace of the corresponding 
outliers, such as the food price index and its 'negative', the production price 
index excluding food, which are strongly sensitive to war events. Among 
economic indicators, the most responsive series to these events appear to be 
the inventory-sales ratio, the manufacturing new orders and the motor and 
motor parts industrial production indexes. 

In order to answer precisely the above question, we first turn to the anal- 
ysis of the empirical self-correlation matrices Cx and Cy, which we diagonal- 
ize and represent the eigenvalues compared to the corresponding Marcenko- 
Pastur distributions in Fig. 4, expected if the variables were independent 
(see the Appendix for more details). Since the both the input and output 
variables are in fact rather strongly correlated at equal times, it is not surpris- 
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ing to find that some large eigenvalues A emerge from the Marcenko-Pastur 
noise band: for Cx, the largest eigenvalue is ~ 15, to be compared to the 
theoretical upper edge of the Marcenko-Pastur distribution 2.358, whereas 
for Cy the largest eigenvalue is ~ 6.2 to be compared with 1.858. But the 
most important point for our purpose is the rather large number of very small 
eigenvectors, much below the lower edge of the Marcenko-Pastur distribution 
(Amin = 0.215 for Cx, see Fig. 4). These correspond to linear combinations of 
redundant (strongly correlated) indicators. Since the definition of X and Y 
include a factor 1/y/X (see Eq. (J2J)), the eigenvectors corresponding to these 
small eigenvalues have an artificially enhanced weight. One expects this to 
induce some extra noise in the system, as will indeed be clear below. Hav- 
ing constructed the set of strictly uncorrelated, unit variance input X and 
output Y variables, we determine the singular value spectrum of G = YX T . 
If we keep all variables, this spectrum is in fact indistinguishable from pure 
noise when X precedes Y by one month, and only one eigenvalue emerges 
(s max « 0.87 instead of the theoretical value 0.806) when X and Y are si- 
multaneous. 

If we now remove redundant, noisy factors that correspond to, say, A < 
Amin/2 ~ 0.1 both in Cx and Cy, we reduce the number of factors to 50 for 
X and 16 for Y 2 . The cumulative singular value spectrum of this cleaned 
problem is shown in Fig. 5 and compared again to the corresponding ran- 
dom benchmark. In this case, both for the simultaneous and lagged cases, 
the top singular values s max ~ 0.73 (resp. s max ~ 0.81) are very clearly above 
the theoretical upper edge s ue ~ 0.642, indicating the presence of some true 
correlations. The top singular values s max rapidly sticks onto the theoreti- 
cal edge as the lag increases. For the one-month lagged case, there might 
be a second meaningful singular value at s = 0.66. The structure of the 
corresponding eigenvectors allows one to construct a linear combination of 
economic indicators explaining a linear combinations of CPIs series. The 
combination of economic indicators corresponding to the top singular value 
evidences the main economic factors affecting inflation indicators: oil prices 
obviously correlated to energy production increases and electricity produc- 
tion decreases that explain the CPIs indexes including oil and energy. The 
second factor includes the next important elements of the economy: employ- 
ment (new payrolls) affects directly the "core" indexes and the CPI indexes 

2 The results we find are however weakly dependent on the choice of this lower cut-off, 
provided very small A's are removed. 
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Figure 4: Eigenvalue spectrum of the N x N correlation matrix of the input 
variables Cx, compared to the Marcenko-Pastur distribution with parameter 
n = 76/265. Clearly, the fit is very bad, meaning that the input variables are 
strongly correlated; the top eigenvalues A max ~ 15 is in fact not shown. Note the 
large number of very small eigenvectors corresponding to combinations of strongly 
correlated indicators, that are pure noise but have a small volatility. 

excluding oil. New economy production (high tech, media & communica- 
tion) is actually a proxy for productivity increases, and therefore exhibits 
a negative correlation with the same core indexes. We have also computed 
the inverse participation ratio of all left and right eigenvectors with similar 
conclusions all eigenvectors have a participation ratio close to the in- 
formationless Porter-Thomas result, except those corresponding to singular 
values above the upper edge. 

Since Yt-i may also contain some information to predict Y t , one could 
also study, in the spirit of general Vector Autoregressive Models [B1E1II], the 
case where we consider the full vector of observables Z of size 111, obtained 
by merging together X and Y. We again define the normalised vector Z, 
remove all redundant eigenvalues of ZZ' smaller than 0.1, and compute the 
singular value spectrum of Z t Zj_ v The size of this cleaned matrix is 62 x 62, 



~ 20 
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0.2 0.4 0.6 0.8 1 

s 

Figure 5: Cumulative singular value distribution for the "cleaned" problem, i.e. 
removing the factors with very small volatilities, leaving 50 factors in X and 16 in 
Y . The correlations we consider are lagged and correspond to Y t Xf_-^. The filled 
circles correspond to the 16 empirical singular values, and the plain line is the 
theoretical prediction in the purely random case with n = 50/265 and m = 16/265. 
Note that the top singular value s max ~ 0.81 clearly stands out of the noise band, 
the edge of which is at s ue = 0.642. Finite T corrections are expected to smooth 
the edge over a region of size T -2 / 3 ~ 0.025 for T = 265. 
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and the upper edge of the random singular value spectrum is s ue ~ 0.84. We 
now find that the top singular value is at s max ~ 0.97, and that ~ 8 factors 
have singular values above the upper edge of the random spectrum. The 
top singular value corresponds to sales and inventory/sales ratio, followed by 
CPIs that tend to be correlated over time. Further results are less intuitively 
simple. This analysis can of course be generalized to larger lags, by studying 
Z t Zj_ n . We find that even for n — 4, there are still three singular values 
above the upper edge. The SVD results are therefore of great help to rank 
the importance of autocorrelations of degree n in the system; we will explore 
this point further in a future publication. 

5 Conclusions and extensions 

The conclusions of this illustrative empirical study are twofold: (i) in gen- 
eral, both input and output variables have a non trivial correlation structure, 
with many redundant factors which add a significant amount of noise in the 
problem. Therefore, in a first step, some data cleaning must be performed by 
eliminating these redundant variables; (ii) the singular value spectrum, com- 
pared to its purely random counterpart, allows one to answer precisely the 
question of the number and relevance of independent predictable factors in 
the problem under study. In the case considered, we have seen that although 
the number of pairs of apparently correlated factors is large (see Fig. 3), only 
a few modes can in fact be considered as containing useful information, in 
the sense that their singular value exceeds our analytical upper edge given 
in Eq. (|TT|) . When studying the full problem where all variables are treated 
together, we find that the effective dimensionality of the problem drops from 
111 to eight or so independent, predictable factors. This compares quite well 
with the number seven quoted by Stock and Watson within their dynamical 
factor analysis of a similar data set pQ . A more thorough comparison of our 
results with those of the econometrics literature will be presented elsewhere. 

What we mean by 'exceed the upper edge' should of course be specified 
more accurately, beyond the eye-balling procedure that we implicitly rely 
on. In order to have a more precise criterion, one should study the statistics 
of the top eigenvalue of V, which is, in analogy with the known results for 
the Wishart ensemble, most probably given by a Tracy- Widom distribution, 
at least for Gaussian random variables (see [2U 123 for recent progress and 
references). For finite T, we expect the top eigenvalue of V to ooze away 
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from the theoretical edge by a quantity of order T~ 2 / 3 w 0.025 for T = 
265. Therefore, the difference between s max « 0.81 and the theoretical edge 
s ue = 0.642 reported in Fig. 5 can safely be considered as significant when 
all variables are Gaussian. However, although the density of singular values 
is to a large degree independent of the distribution of the matrix entries, 
one should expect that the fuzzy region around the theoretical edge expands 
significantly if the input and output variables have fat tails. In particular, 
the Tracy- Widom distribution is expected to breakdown in some way that 
would be very interesting to characterize precisely. We leave this problem to 
future investigations. 

In conclusion, we have presented a general method to extract statistically 
meaningful correlations between an arbitrary collection of input and output 
variables of which only a finite time sample is available. Our central result 
is derived from the theory of free random matrices, and gives an explicit 
expression for the interval where singular values are expected in the absence 
of any true correlations between the variables under study. Our result can 
be seen as the natural generalization of the Marcenko-Pastur distribution 
for the case of rectangular correlation matrices. The potential applications 
of this method are quite numerous and we hope that our results will prove 
useful in different fields where multivariate correlations are relevant. 

Acknowledgments: We wish to thank Gerard Ben Arous, Florent Benaych- 
Georges and Jack Silverstein for most useful discussions on Random Matrix 
Theory. 

Appendix: the MP 2 case 

As indicated in the main text, one could have chosen as a benchmark the 
case where all (standardized) variables X and Y are uncorrelated, meaning 
that the ensemble average E(C X ) = E(XX T ) and E(C Y ) = E(YY T ) are 
equal to the unit matrix, whereas the ensemble average cross-correlation 
E(G) = E(YX T ) is identically zero. However, for a given finite size sample, 
the eigenvalues of Cx and C Y will differ from unit, and the singular values 
of G will not be zero. The statistics of the eigenvalues of Cx and Cy is well 
known to be given by the Marcenko-Pastur distribution with parameters n 



16 



and m respectively, which reads, for (3 = n, m < 1: 

Pmp(\) = 2~^yR^f ~~ A min )(A max - A), (14) 

with 

A mi „ = (1 - V^) 2 A max = (1 + ^) 2 . (15) 
The E-transform of this density takes a particularly simple form: 

E <*> = rh~ x (16) 

Now, as explained in the main text, the singular values of G are obtained as 
the square-root of the eigenvalues of D = X T XY T Y. Since X T X and Y T Y 
are mutually free, one can again use the multiplication rule of E-transforms, 
after having noted that the E-transform of the T x T matrices X T X and 
Y T Y are now given by: 

E <*> = w~ x (17) 

One therefore finds that the i] transform of D is obtained by solving the 
following cubic equation for x: 

v -\l + x) = -, (18) 

x{n + x)[m + x) 

which can be done explicitely, leading to the following (lengthy) result. De- 
note z = s 2 , one should first compute the following two functions: 

fi(z) = 1 + m 2 + n 2 — mn — m — n + 3z (19) 

and 

/ 2 (^) = 2 — 3m(l— m) — 3n(l— n) — 3mn(n+m— 4)+2(m 3 +n 3 )+9z( y l+m+n). 

(20) 

Then, form: 

A = -Ah{zf + f 2 (z) 2 . (21) 
If A > 0, one introduces a second auxiliary variable T: 

r = f 2 {z) - VK, (22) 
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to compute P2{z): 



(23) 



24/331/2^ 2 2 /33V2rl/3 i2 - 



Finally, the density p(s) is given by: 



p(s) = 2sp 2 {s 2 ). 



(24) 
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